Estimating the density of deep eutectic solvents applying supervised machine learning techniques

Deep eutectic solvents (DES) are recently synthesized to cover limitations of conventional solvents. These green solvents have wide ranges of potential usages in real-life applications. Precise measuring or accurate estimating thermophysical properties of DESs is a prerequisite for their successful applications. Density is likely the most crucial affecting characteristic on the solvation ability of DESs. This study utilizes seven machine learning techniques to estimate the density of 149 deep eutectic solvents. The density is anticipated as a function of temperature, critical pressure and temperature, and acentric factor. The LSSVR (least-squares support vector regression) presents the highest accuracy among 1530 constructed intelligent estimators. The LSSVR predicts 1239 densities with the mean absolute percentage error (MAPE) of 0.26% and R2 = 0.99798. Comparing the LSSVR and four empirical correlations revealed that the earlier possesses the highest accuracy level. The prediction accuracy of the LSSVR (i.e., MAPE = 0. 26%) is 74.5% better than the best-obtained results by the empirical correlations (i.e., MAPE = 1.02%).


Laboratory-measured datasets
The objective of the current study is constructing an intelligent tool to approximate the density of deep eutectic solvents precisely. Identical to the regression-based correlation 23 , all intelligent methods also need a laboratorymeasured database to adjust their parameters and test their prediction reliability 24,25 . Thus, 1239 experimentally measured datasets for the density of deep eutectic solvents have been gathered from thirty references and engaged in the model development/validation stage. The summary of the collected density data has been presented in Table 1. This table introduces the name of hydrogen bond donors and hydrogen bond acceptors of the considered deep eutectic solvents. As Table 1 shows, the gathered databank includes thirteen HBA and forty-two HBD ingredients. This table also indicates the number of measurements and ranges of the working temperature and measured density.
Critical pressure, critical temperature, and acentric factor. This study aims to build a single model to anticipate the density of 149 various deep eutectic solvents. Therefore, it is mandatory to include inherent characteristics of these materials in the list of independent variables. The three-parameter corresponding state theory explains that each material has its own specific acentric factor, critical temperature, and critical pressure 26 . Hence, these parameters could help the machine learning method distinguish different deep eutectic solvents and discriminate among their density values 27 . Haghbakhsh et al. 28 utilized the improved Lydersen-Joback-Reid Table 1. Summary of the reported laboratory-measured density for diverse deep eutectic solvents in the literature.  www.nature.com/scientificreports/ group contribution 12 and the Lee-Kesler mixing rules 29 to estimate acentric factor and critical temperature/pressure of different deep eutectic solvents. Table 2 presents the range of these inherent characteristics for all considered deep eutectic solvents 28 . The supplementary excel files includes all experimental databank utilized in the current study.
In order to reduce the table size, the reported values have been presented for deep eutectic solvents based on their hydrogen bond acceptor type. Specific values of the acentric factor, critical temperature, and critical pressure for each deep eutectic solvent can be found in Haghbakhsh et al. article 28 .

Estimation scenarios for density of deep eutectic solvents
The literature has suggested several empirical correlations for estimating the liquid's density. Furthermore, the current study focuses on seven machine learning methods to anticipate the density of 149 deep eutectic solvents. The mathematical formulation/background of the available empirical correlations and machine learning methods has been briefly reviewed in this section.

Empirical correlations. Rackett correlation. Rackett's correlation is likely the first equation developed
to calculate the saturated liquid's density 54 . As Eq. (1) explains, the molar volume ( ν ) is estimated as a function of temperature (T) and critical pressure (Pc), molar volume ( ν c ), and temperature (Tc). R and Tr show the gas constant and reduced temperature (Eq. 2), respectively.
Equation (3) is then possible to be used to reach the density ( ρ ) from the molecular weight (M) and estimated molar volume. Although Rackett's correlation was initially suggested for the saturated liquid's density, it has also presented good predictions for the deep eutectic solvent 55 .
Spencer and Danner correlation. Spencer and Danner incorporate a base molar volume measurement ( ν ref ) at a base temperature ( T ref ) in Rackett's correlation 56 . Equations (4) and (5) introduce the modified Rackett model, i.e., the Spencer and Danner correlation. It can be seen that all empirical correlations utilize the temperature and inherent characteristics of the material (a combination of the ν c , P c , T c , and ω) to formulize the liquid's density. Since the first three inherent properties (T c , P c , and ν c ) are related through the following equation, it is unnecessary to utilize all of them.
Therefore, the current study only utilizes temperature, T c , P c , and ω to estimate the DES's density employing different intelligent estimators (Eq. 11).

Computational intelligent methods. Wide ranges of supervised and unsupervised artificial intelligence
techniques have been suggested and applied in different modeling studies [58][59][60][61][62][63] . The working procedures of the used machine learning methods, i.e., least-squares support vector regression (LSSVR), hybrid neuro-fuzzy system, and five types of artificial neural networks have been briefly explained in this section.
Least-squares support vector regression. This intelligent estimator employs a particular equation (i.e., linear, Gaussian, and polynomial kernel function) to transfer original independent variables ( ξ ) to a multi-dimensional computational domain. The following equation defines these functions.
The superscript of T shows the transpose operation. In addition, ε , σ , and δ are the kernel-related parameters. It is then possible to linearly relate the dependent ( γ ) to the independent ( χ ) variables in this new computational domain utilizing Eq. (13).
In Eq. (13), γ LSSVR represents the estimated target by the least-squares support vector regression. Furthermore, w and b are adjustable coefficients of this intelligent model. In summary, the kernel type is the main topology feature of the LSSVR that should be determined by a practical scenario like the trial-and-error process 64 .
The detailed working process of the least-squares support vector machine has recently been explained by Nabavi et al. 64 .
Artificial neural networks. This neuron-based machine learning method is the most widely-used tool as either estimator 65,66 or classifier 67 . The working process of the artificial neural network is handled by a combination of linear (LPart) and non-linear (NLPart) operations conducted by the neuron as follows 68 : www.nature.com/scientificreports/ w, b, and φ are weight and bias coefficients and activation function, respectively. Although a linear activation function exists, the non-linear, continuous, and differentiable ones often provide artificial neural networks with a better generalization ability 69 . Equation (16)  Hybrid neuro-fuzzy systems. The idea of combining the artificial neural network 76,77 and fuzzy logic 78,79 has resulted in a new class of machine learning, namely adaptive neuro-fuzzy inference system 80,81 . This method estimates a target response employing five successive layers (i.e., fuzzification, rule, normalization, defuzzification, and output) 82 . Shojaei et al. have comprehensively described the mathematical operations performed in each layer of the adaptive neuro-fuzzy inference system 82 . The membership function utilized in the fuzzification layer 83 , numbers of the cluster 80 , cluster radius 84 , and training algorithm 25 are the main structural features that are often regulated by the trial-and-error scenario.

Results and discussions
This section comprehensively explains the followed procedure to choose the best intelligent method for estimating the DES's density and determining its structural features. The accuracy of this smart approach and available correlations in the literature has then been compared. Several numerical and graphical analyses have also been employed for further monitoring the accuracy of the best model for predicting the density of deep eutectic solvents.
Constructing intelligent models. Topology determination. The topology of machine learning methods is often determined by trial-and-error practice [85][86][87] . This practical scenario changes the core features of a machine learning scheme and monitors its accuracy in diverse stages of the model development [88][89][90] . Table 3 specifies the core features of the considered intelligent techniques and their investigation range during the trialand-error procedure. The literature approved that artificial neural networks with one hidden layer are accurate enough to simulate a wide range of problems 72,[91][92][93] . Consequently, the multilayer perceptron (MLP), recurrent (RNN), cascade feedforward (CFF), general regression (GR), and radial basis function (RBF) have been fabricated with only one hidden layer.
Selecting the best topology of the intelligent methods. The core features of the machine learning methods have been changed according to the reported values in Table 3, both training and testing stages have been performed, and accuracy has been monitored utilizing several statistical indexes. Various uncertainty criteria, including MAPE (mean absolute percentage error), RMSE (root mean square error), RAPE (relative absolute percentage error), MAE (mean absolute error), and R 2 (regression coefficient), have been utilized to accuracy monitor of the developed intelligent scenarios and selecting the most precise ones.
Equations (17) to (21) express the mathematical shapes of the MAPE, MAE, RAE, RMSE, and R 2 , respectively. www.nature.com/scientificreports/ These equations only need the actual ( ρ exp ), predicted ( ρ pred ), and average ( ρ ave exp ) density values and numbers of the dataset (n) to measure the accuracy of any constructed model.
The most precise density estimations obtained by each machine learning method have been reported in Table 4. The accuracy monitoring approves that 1) the Gaussian function is the best kernel for LSSVR, 2) eleven hidden neurons is the best feature for the MLP, 3) ten hidden neurons provides the CFF with the best performance, 4) spread factor of 0.04312 and 1053 hidden neurons should be used in the GR structure, 5) the RBF is better to construct by spread factor of 1.0526 and eleven hidden neurons, and 6) the ANFIS (adaptive neuro-fuzzy inference systems) with the subtractive clustering membership function, twelve clusters, and hybrid training algorithm has the best performance.
Although all these prediction accuracies confirm a high level of consistency with the laboratory-measured density, the LSSVR and RBF neural network present the highest and lowest precise results, respectively. For systematical approving this claim, the subsequent analysis has ranked these selected intelligent models based on their prediction accuracy in different stages of model development. www.nature.com/scientificreports/ Selecting the best intelligent model using the ranking analysis. The ranking analysis is a well-established procedure to arrange several models based on their performance. The previous step measured the prediction ability of the seven selected intelligent models using five well-known statistical indexes. Now the ranking analysis utilizes the numerical values of these statistical indexes to arrange them from the best to the worst model. Equation (22) indicates that the selected models have been ranked based on their average rankings over five statistical criteria (indx).
This ranking analysis has been separately applied to the model's performances during the learning and testing stages. Furthermore, the rank orders of the chosen intelligent models have also been tracked over the whole 1239 datasets. Figure 1 displays the rank order of the LSSVR, artificial neural network models (i.e., MLP, RNN, RBF, CFF, and GR), and ANFIS over three different databases. It can be easily inferred that the LSSVR with the three first ranking places and the RBF neural network with the three seventh ranking places are the best and worst tools for calculating the density of deep eutectic solvents. The ranking order of other constructed models has also displayed in this figure.
In summary, it can be claimed that the LSSVR equipped with the Gaussian kernel function is the most trustful model for calculating the density of deep eutectic solvents from temperature and inherent characteristics (i.e., ω, Tc, and Pc) of the involved substance. This highly accurate model  Fig. 2. The observed results confirm that the LSSVR is the most accurate tool for estimating the density of deep eutectic solvents. The LSSVR anticipates 1239 density samples of 149 deep eutectic solvents with the MAPE = 0.26%, while the most accurate empirical correlation (Spencer and Danner model) presents the MAPE = 1.02% for an entirely similar database. The suggested LSSVR improves the best previously achieved accuracy by more than 74%.
Validation using graphical inspections. The anticipated densities by the LSSVR ( ρ LSSVR ) versus their counterpart experimental values (i.e., cross-plot) have been shown in Fig. 3. This cross-plot separately presents the LSSVR predictions for both learning and testing steps. Two straight lines associated with the relative deviation percent (RD%) of − 2% and + 2% have also been added to this figure. Equation (23) expresses the formula of the RD%. www.nature.com/scientificreports/ Figure 3 displays that about ten density samples have been anticipated with the RD% of lower than − 2% and higher than + 2%. The excellent ability of the built LSSVR to estimate the density of deep eutectic solvents can be readily approved by this observation.
The kernel density estimation is a reliable method for visually inspecting the compatibility between a given variable's actual and anticipated values. As Fig. 4 shows, this method depicts the cumulative distribution function (CDF) as a function of the experimental values of a given variable. Figure 4A-C illustrate the compatibility between actual and anticipated density values over the training and testing subdivisions and the whole database. Excluding the intermediate values of the DES's density, a remarkable consistency can be seen between actual and predicted values. Moreover, it can be detected that both the experimental data and the LSSVR predictions have a standard Gaussian distribution shape.
The magnitude of difference between actual and predicted densities (the residual error, i.e., RE) is another statistical index applied to monitor the prediction accuracy of the built LSSVER. The mathematical expression of the RE is given in Eq. (24).
Based on reported results in Fig. 5, 61% of the available samples have been estimated with a residual error of less than 2 kg/m 3 . Moreover, the LSSVR successfully anticipated 84% of the experimental databank with an RE of lower than 5 kg/m 3 . Only 16% of the gathered database has been estimated with a residual error of higher than 5 kg/m 3 . All these observations confirm the excellent compatibility between calculated densities by the LSSVR and their related actual measurements.
Checking the reliability of the gathered database. The gathered experimental data had a central role during the development/validation/selection of machine learning methods hereinbefore. Furthermore, this experimental databank has been used to compare the accuracy of empirical correlations and the selected LSSVR.  www.nature.com/scientificreports/ The entire previous findings are valid only if the gathered laboratory-measured densities have an acceptable validity level. The leverage is a well-trusted technique to detect both valid and outlier data in an experimentallymeasured database 94 . This technique plots the standardized residuals (SR) against the Hat index to accomplish its duty 89 . Equation (27) explains that the SR can be obtained by dividing the average value ( RE ave ) and standard deviation (SD) of the residual error. Equations (25) and (26) give the RE ave and SD formula, respectively.
Furthermore, numerical values of the Hat index (HI) can be reached by applying Eq. (28) on the matrix of the independent variables ( ξ) 95 . The superscripts of T and -1 stand for the transpose and inverse operations, respectively. Figure 6 shows the plot of SR versus the HI values associated with the DES's density databank. The leverage method states that the region bounded by the -3 < SR < + 3 and HI lower than the critical leverage is valid, and all other positions are the suspect domain 96 . Equation (29) helps calculate the critical leverage (CL) from the number of independent variables (NIV) and experimental data points (n) 83,95 . Having four independent variables and 1239 data points, the CL equals 0.0121.
The leverage method approves that 1210 out of 1239 data points have appeared in the valid zone, and only 29 density samples may be outlier measurements. It can be claimed that the validity of the gathered database has been approved now, and all previous findings based on this databank are trustful.
LSSVR accuracy for density predicting each deep eutectic solvent. It may be a good idea to monitor the prediction accuracy of the LSSVR against the deep eutectic solvents with the same HBA agent. Since the average relative deviation (Eq. 30) 97 clarifies both underestimated and overestimated predictions, it has been selected to measure the LSSVR accuracy in this stage.  www.nature.com/scientificreports/ Figure 7 states that the density of thirteen classes of the deep eutectic solvent with the HBA#1 to HBA#13 (see Table 1) has been estimated with the ARD ranges from − 0.24 to + 0.17%. Those deep eutectic solvents having the HBA #1, 9, and 13 have been underestimated by the LSSVR. On the other hand, the DESs with the HBA #3, 5, and 12 have been overestimated. The ARD% associated with the other deep eutectic solvent classes is almost equal to zero.
Investigating the effect of temperature, and HBD/HBA types. The effect of temperature on the density of deep eutectic solvents with the specific HBA agent (i.e., Choline chloride) and different HBD substances can be deduced from Fig. 8. This figure reports both experimentally-measured densities and their counterparts simulated values by the LSSVR. This figure readily justifies an excellent agreement between experimental and predicted density values. The LSSVR effectively discriminates between the effect of HBD type and working temperature on the density of the Choline chloride-based DESs and accurately estimates all distinct data points. Like the conventional liquid, the density of deep eutectic solvents decreases by increasing the working temperature. Increasing the intermolecular void volume in the DES's body by increasing the temperature has been introduced as responsible for this observation 98 .
The density variation of deep eutectic solvents with the temperature and HBA type has been exhibited in Fig. 9. All DESs in this analysis have glycerol as their HBD agent. A high level of compatibility between actual density values and their counterparts estimated by the LSSVR can be seen in Fig. 9. The LSSVR distinguishes the effect of HBA type and temperature on the DES's density and accurately anticipates all individual density data points.

Simple flowchart of our study
A simple and understandable flowchart for the stages followed in the current research study has been presented in Fig. 10. This figure can be broken down into four distinct parts as follows: 1. Developing machine learning methods 2. Comparing accuracy performances of the machine learning methods and empirical correlations 3. Selecting the model with the highest prediction accuracy 4. Utilizing the model chosen for further analyzing purposes

Conclusion
The accuracy of seven machine learning methods and four empirical correlations has been compared to find the highest accurate tool for estimating the density of 149 deep eutectic solvents. Huge performed statistical analyses proved that the least-squares support vector regression equipped with the Gaussian kernel function is more accurate than the other methods investigated. This suggested scheme predicted 1239 experimentally-measured    www.nature.com/scientificreports/